Bellabeat is a high-tech manufacturer of health-focused products for
women. it is a successful small company, but they have the potential to
become a larger player in the global smart device market, for mor
information about the company, click here. Urška Sršen,
cofounder and Chief Creative Officer of Bellabeat, believes that
analyzing smart device fitness data could help unlock new growth
opportunities for the company. She asked marketing team to focus on one
of Bellabeat’s products and analyze smart device data to gain insight
into how consumers are using their smart devices.
In this case study I assumed, I’m a jonior data analyst who is working
for Bellabeat marketing team. I will present my analysis to the
Bellabeat executive team along with my high-level recommendations for
Bellabeat’s marketing strategy.
Identifying trends in non-Bellabeat smart device usage and focus on a Bellabeat product, Then, using this information, provide high-level recommendations for how these trends can inform Bellabeat marketing strategy.
Sršen encourages analytics team to use public data that explores smart device users’ daily habit. She points team to FitBit Fitness Tracker Data, This Kaggle data set dataset, made available through Mobius, contains personal fitness tracker from thirty fitbit users, including minute-level output for physical activity, heart rate, and sleep monitoring.
FitBit Fitness Tracker Data generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 to 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. This is a third party public data set, with small sample size, no demographic information, no gender information and out of date, which could lead to bias, but still has alot of informations about 30 Fitbit user which can be useful for our analysis.
I will fulfill my analysis in RStudio. I am using R Markdown to demonstrate the steps of this analysis and create this notebook.
install.packages("tidyverse")
install.packages("plotly")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
daily_activity <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heart_rate <- read_csv("/Users/tohidshokati/Desktop/Google data analysis/Case study/Bellabeat/Fitabase/heartrate_seconds_merged.csv")
## Rows: 2483658 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Time
## dbl (2): Id, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity)
## # A tibble: 6 × 15
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 4/12/2… 13162 8.5 8.5 0 1.88 0.550 6.06 0
## 2 1.50e9 4/13/2… 10735 6.97 6.97 0 1.57 0.690 4.71 0
## 3 1.50e9 4/14/2… 10460 6.74 6.74 0 2.44 0.400 3.91 0
## 4 1.50e9 4/15/2… 9762 6.28 6.28 0 2.14 1.26 2.83 0
## 5 1.50e9 4/16/2… 12669 8.16 8.16 0 2.71 0.410 5.04 0
## 6 1.50e9 4/17/2… 9705 6.48 6.48 0 3.19 0.780 2.51 0
## # … with 5 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>, and
## # abbreviated variable names ¹ActivityDate, ²TotalSteps, ³TotalDistance,
## # ⁴TrackerDistance, ⁵LoggedActivitiesDistance, ⁶VeryActiveDistance,
## # ⁷ModeratelyActiveDistance, ⁸LightActiveDistance, ⁹SedentaryActiveDistance
head(daily_sleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalT…¹
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 1 327 346
## 2 1503960366 4/13/2016 12:00:00 AM 2 384 407
## 3 1503960366 4/15/2016 12:00:00 AM 1 412 442
## 4 1503960366 4/16/2016 12:00:00 AM 2 340 367
## 5 1503960366 4/17/2016 12:00:00 AM 1 700 712
## 6 1503960366 4/19/2016 12:00:00 AM 1 304 320
## # … with abbreviated variable name ¹TotalTimeInBed
head(heart_rate)
## # A tibble: 6 × 3
## Id Time Value
## <dbl> <chr> <dbl>
## 1 2022484408 4/12/2016 7:21:00 AM 97
## 2 2022484408 4/12/2016 7:21:05 AM 102
## 3 2022484408 4/12/2016 7:21:10 AM 105
## 4 2022484408 4/12/2016 7:21:20 AM 103
## 5 2022484408 4/12/2016 7:21:25 AM 101
## 6 2022484408 4/12/2016 7:22:05 AM 95
head(hourly_steps)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
Lets see how many unique participants there are in each dataframe. It looks like there may be more participants in the daily activity dataset than the sleep dataset.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(daily_sleep$Id)
## [1] 24
n_distinct(heart_rate$Id)
## [1] 14
n_distinct(hourly_steps$Id)
## [1] 33
There are 33 participants in daily activity and hourly steps data frames, 24 in daily sleep and only 14 in heart rate data set.
First of all, I would like to check for duplicated observations.
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(heart_rate))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
There are 3 duplicated values in daily sleep data set. I going to remove them.
daily_sleep <- distinct(daily_sleep)
The dates in all for datasets were formatted as string (chr) and need to converted to date format before starting the analysis. also, I will rename these columns to date to increase consistency.
daily_activity <- daily_activity %>%
mutate(ActivityDate = mdy(ActivityDate)) %>%
rename(date = ActivityDate)
Time stamps in daily sleep, hourly steps and heart rate data frames were formatted as string too, I will convert them to Time-Date format and then, will split them to date and time columns.
daily_sleep$SleepDay=as.POSIXct(daily_sleep$SleepDay, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
daily_sleep <- separate(daily_sleep, SleepDay, into=c('date', 'time'), sep=' ', remove=TRUE) %>%
mutate(date=as_date(date), time=hms::as_hms(time))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 410 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
hourly_steps$ActivityHour=as.POSIXct(hourly_steps$ActivityHour, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
hourly_steps <- separate(hourly_steps, ActivityHour, into=c('date', 'time'), sep=' ', remove=TRUE) %>%
mutate(date=as_date(date), time=hms::as_hms(time))
heart_rate$Time=as.POSIXct(heart_rate$Time, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
heart_rate <- separate(heart_rate, Time, into=c('date', 'time'), sep =' ', remove=TRUE) %>%
mutate(date=as_date(date), time=hms::as_hms(time))
I would like to confirm format corrections by by running
STR() function.
str(daily_activity)
## tibble [940 × 15] (S3: tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:940], format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
str(daily_sleep)
## tibble [410 × 6] (S3: tbl_df/tbl/data.frame)
## $ Id : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:410], format: "2016-04-12" "2016-04-13" ...
## $ time : 'hms' num [1:410] NA NA NA NA ...
## ..- attr(*, "units")= chr "secs"
## $ TotalSleepRecords : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
str(heart_rate)
## tibble [2,483,658 × 4] (S3: tbl_df/tbl/data.frame)
## $ Id : num [1:2483658] 2.02e+09 2.02e+09 2.02e+09 2.02e+09 2.02e+09 ...
## $ date : Date[1:2483658], format: "2016-04-12" "2016-04-12" ...
## $ time : 'hms' num [1:2483658] 07:21:00 07:21:05 07:21:10 07:21:20 ...
## ..- attr(*, "units")= chr "secs"
## $ Value: num [1:2483658] 97 102 105 103 101 95 91 93 94 93 ...
str(hourly_steps)
## tibble [22,099 × 4] (S3: tbl_df/tbl/data.frame)
## $ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date[1:22099], format: "2016-04-12" "2016-04-12" ...
## $ time : 'hms' num [1:22099] 00:00:00 01:00:00 02:00:00 03:00:00 ...
## ..- attr(*, "units")= chr "secs"
## $ StepTotal: num [1:22099] 373 160 151 0 0 ...
now that the data is cleaned, I’m ready to analysis the data sets.
Let’s start analyzing our data with a sneak peek into summary statistics.
daily_activity %>%
select(TotalSteps, TotalDistance, SedentaryMinutes, Calories) %>%
summary()
## TotalSteps TotalDistance SedentaryMinutes Calories
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
## Median : 7406 Median : 5.245 Median :1057.5 Median :2134
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
daily_sleep %>%
select(TotalMinutesAsleep, TotalSleepRecords, TotalTimeInBed) %>%
summary()
## TotalMinutesAsleep TotalSleepRecords TotalTimeInBed
## Min. : 58.0 Min. :1.00 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:1.00 1st Qu.:403.8
## Median :432.5 Median :1.00 Median :463.0
## Mean :419.2 Mean :1.12 Mean :458.5
## 3rd Qu.:490.0 3rd Qu.:1.00 3rd Qu.:526.0
## Max. :796.0 Max. :3.00 Max. :961.0
heart_rate %>%
select(Value) %>%
summary()
## Value
## Min. : 36.00
## 1st Qu.: 63.00
## Median : 73.00
## Mean : 77.33
## 3rd Qu.: 88.00
## Max. :203.00
hourly_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 40.0
## Mean : 320.2
## 3rd Qu.: 357.0
## Max. :10554.0
ggplot(data=daily_activity, mapping = aes(x = TotalSteps, y = Calories)) +
geom_point(color = "blue") +
geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Total Steps vs. Calories",
x = "Total Steps", y = "Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Number of steps clearly correlated to number of burned calories. Let’s take a look to relationship between time spent in bed to total sleep time per day.
ggplot(data=daily_sleep, mapping = aes(x = TotalMinutesAsleep, y = TotalTimeInBed)) +
geom_point(color = "blue") +
geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Total Sleep vs. Total Time In Bed",
x = "Total Asleep (Minutes)", y = "Total Time In Bed (Minutes)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As it was expected, it’s almost completely linear.
hourly_steps %>%
group_by(time) %>%
summarize(average_steps = mean(StepTotal)) %>%
ggplot() +
geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) +
labs(title = "FitBit Tracker Data", subtitle = "Hourly Steps Per Day", x="Time", y="Average Steps") +
scale_fill_gradient(low = "black", high = "navy", name = "Average Steps") +
theme(axis.text.x = element_text(angle = 45))
It’s clear that our participants are more active between 5PM to 7PM.
probably they go to gym or maybe a walk after work. it’s interesting to
see 11AM to 2PM are very active hours as well. Now, I would like to
explore the relationship between exercise and sleep. in order to check
to see if there is any correlation between them, I need to join
daily_activity and daily_sleep data sets.
daily_activity_sleep <- merge(daily_activity, daily_sleep, by=c('Id', 'date'))
Let’s take a look to our merged data set.
str(daily_activity_sleep)
## 'data.frame': 410 obs. of 19 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : num 13162 10735 9762 12669 9705 ...
## $ TotalDistance : num 8.5 6.97 6.28 8.16 6.48 ...
## $ TrackerDistance : num 8.5 6.97 6.28 8.16 6.48 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.14 2.71 3.19 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 1.26 0.41 0.78 ...
## $ LightActiveDistance : num 6.06 4.71 2.83 5.04 2.51 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num 25 21 29 36 38 50 28 19 41 39 ...
## $ FairlyActiveMinutes : num 13 19 34 10 20 31 12 8 21 5 ...
## $ LightlyActiveMinutes : num 328 217 209 221 164 264 205 211 262 238 ...
## $ SedentaryMinutes : num 728 776 726 773 539 775 818 838 732 709 ...
## $ Calories : num 1985 1797 1745 1863 1728 ...
## $ time : 'hms' num NA NA NA NA ...
## ..- attr(*, "units")= chr "secs"
## $ TotalSleepRecords : num 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep : num 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num 346 407 442 367 712 320 377 364 384 449 ...
As it was expected, due to inner join of the data sets, it has 410 observations.
ggplot(data=daily_activity_sleep, mapping = aes(x = TotalSteps, y = TotalMinutesAsleep)) +
geom_point(color = "blue") +
geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Total Steps vs. Total Sleep Time",
x = "Total Average Steps Per Day", y = "Total Sleep Time (Minutes)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
It’s look like there is not significant correlation between average daily steps and sleep duration. Let’s check relation of sedentary and sleep.
ggplot(data=daily_activity_sleep, mapping = aes(y = SedentaryMinutes, x = TotalMinutesAsleep)) +
geom_point(color = "blue") +
geom_smooth(color = "black") + labs(title = "FitBit Tracker Data", subtitle = "Average Sedentary vs. Total Sleep Time",
y = "Average Sedentary Minutes Per Day", x = "Total Sleep Time (Minutes)")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Obviously, there is a negative correlation between sedentary and
sleep. it means people with more sedentary minutes per day tend to have
less sleep during night. a fitness smart device like Fitbit or Bellabeat
leaf can encourage users to excersice for healthy life style.
I realized that some participants didn’t wear their smart device some
days. now, I’m curious to know what is the smart device usage percent
among the owners.
usage <- daily_activity_sleep %>%
group_by(Id) %>%
summarize(worn_days=sum(n())) %>%
mutate(fitbit_usage = case_when(
worn_days >= 1 & worn_days <= 6 ~ "Very low usage",
worn_days >= 7 & worn_days <= 12 ~ "Low usage",
worn_days >= 13 & worn_days <= 18 ~ "Moderate usage",
worn_days >= 19 & worn_days <= 24 ~ "High usage",
worn_days >= 25 & worn_days <= 31 ~ "Very high usage"))
usage_percentage <- usage %>%
group_by(fitbit_usage) %>%
summarise(total_usage_type = n()) %>%
mutate(total_number_of_use = sum(total_usage_type)) %>%
group_by(fitbit_usage) %>%
summarise(percentage = total_usage_type*100/total_number_of_use)
lets take a look to our percentage tibble.
print(usage_percentage)
## # A tibble: 5 × 2
## fitbit_usage percentage
## <chr> <dbl>
## 1 High usage 8.33
## 2 Low usage 4.17
## 3 Moderate usage 12.5
## 4 Very high usage 41.7
## 5 Very low usage 33.3
Let’s make a pie chart to visualize this data. I’m going to load
plotly package to create our pie chart.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
labels = c('High usage','Low usage','Moderate usage','Very high usage', 'Very low usage ')
values = c(8.33, 4.17, 12.5 , 41.7, 33.3)
fig <- plot_ly(type='pie', labels=labels, values=values,
textinfo='label+percent',
insidetextorientation='radial') %>%
layout(title = 'Smart device usage per month by owners')
fig
I appreciate your interest to my project. This is my first ever case study in data analytics. I’m eager to hear any comments or recommendations about it.